-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL-encode partition field names in file locations #1457
URL-encode partition field names in file locations #1457
Conversation
# Test that special characters are URL-encoded | ||
( | ||
[PartitionField(source_id=15, field_id=1001, transform=IdentityTransform(), name="special#string#field")], | ||
["special string"], | ||
Record(**{"special#string#field": "special string"}), # type: ignore | ||
"special%23string%23field=special%20string", | ||
f"""CREATE TABLE {identifier} ( | ||
`special#string#field` string | ||
) | ||
USING iceberg | ||
PARTITIONED BY ( | ||
identity(`special#string#field`) | ||
) | ||
""", | ||
f"""INSERT INTO {identifier} | ||
VALUES | ||
('special string') | ||
""", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I thought this test would work but it fails, and I'm not sure why yet.
It fails
assert expected_hive_partition_path_slice in spark_path_for_justification
with
E AssertionError: assert 'special%23string%23field=special%20string' in 's3://warehouse/default/test_table/data/special#string#field=special+string/00000-57-0756f620-2b2e-4ffa-97f6-625343525c9b-00001.parquet'
and it fails
assert spark_partition_for_justification == expected_partition_record
with
E AssertionError: assert Record[specia...ecial string'] == Record[specia...ecial string']
E - Record[special#string#field='special string']
E + Record[special_x23string_x23field='special string']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
special_x23string_x23field
is related to #590
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevinjqliu. And the first failure is because the Iceberg version on spark
was before apache/iceberg#10329, so it's not URL-encoded (I think).
Given this, I've disabled justification with a message similar to other tests here where behaviour differs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is sufficient (given we're testing PartitionKey
's to_path
, it felt natural but I'm unsure)?
If not, happy to be pointed to somewhere where I can add an integration test similar to the one shown in the issue. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/iceberg@795fea9
should be available starting in 1.6.0
.
It looks like pyspark is using an older library version, can you add this change as see if the test pass?
#1462
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is sufficient (given we're testing PartitionKey's to_path, it felt natural but I'm unsure)?
yea we're testing partition_to_path
, maybe add a test in tests/table/test_partitioning.py
, which is not part of the integration test suite.
The integration test is a nice to have though. lets see if upgrading the iceberg library helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion - bumping made the path test fail because quote
was being used instead of quote_plus
for encoding (the Java implementation encodes spaces to +
which quote
doesn't do).
For consistency, I've made the change to match Java behaviour, but can revert that if consistency isn't so important - what do you think?.
A unit test sounds good (and integration for justification checks would be great).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! thats why we like integration tests :)
let me merge #1462 first, and then we can rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the Record
comparison still fails because of the non-optional sanitisation-transformation described in apache/iceberg#10120. And, as it stands, the provided Record
param is used to check key.partition
so can't be changed because that should be unsanitised, IIUC.
Think some test rewiring might be required - maybe providing a separate record param that's by default the other one, just for this justification check, but I wonder if we're really just testing spark behaviour then.
BTW #1462 is merged, could you rebase this PR? |
999d753
to
f5a35de
Compare
spec_id=3, | ||
) | ||
|
||
record = Record(**{"my#str%bucket": "my+str", "other str+bucket": "( )", "my!int:bucket": 10}) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy complains here and elsewhere but I think it's fine
|
||
# Both partition names fields and values should be URL encoded, with spaces mapping to plus signs, to match the Java | ||
# behaviour: https://github.com/apache/iceberg/blob/ca3db931b0f024f0412084751ac85dd4ef2da7e7/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L198-L204 | ||
assert spec.partition_to_path(record, schema) == "my%23str%25bucket=my%2Bstr/other+str%2Bbucket=%28+%29/my%21int%3Abucket=10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-checked with Java implementation (integration tests will do this eventually), in particular WRT to ' '
and '+'
encoding. It is consistent.
Done, @kevinjqliu. Fails due to #1457 (comment) but will think over it. FYI, am away for a little bit now so will pick this back up shortly in the new year! (Feel free to take over if v urgent 😄) |
Thanks for the PR! I've dug into the test failure a bit. Heres what I found. There's a subtle difference between
You can verify this by looking up the table partition spec.
The integration test assumes that the value for After
|
Thanks a lot for this explanation and suggestion @kevinjqliu! It sounds good. Had some time so I've made this change so tests pass - using |
@@ -234,9 +234,11 @@ def partition_to_path(self, data: Record, schema: Schema) -> str: | |||
partition_field = self.fields[pos] | |||
value_str = partition_field.transform.to_human_string(field_types[pos].field_type, value=data[pos]) | |||
|
|||
value_str = quote(value_str, safe="") | |||
value_str = quote_plus(value_str, safe="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It defaults to utf-8
, so that's good 👍
field_str = quote_plus(partition_field.name, safe="") | ||
field_strs.append(field_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, I would just collapse these:
field_str = quote_plus(partition_field.name, safe="") | |
field_strs.append(field_str) | |
field_strs.append(quote_plus(partition_field.name, safe="")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smaheshwar-pltr for picking this up, and thanks for @kevinjqliu the review. I'll leave this one open in case Kevin has any further comments.
VALUES | ||
('special string') | ||
""", | ||
lambda name: name.replace("#", "_x23").replace("+", "_x2B"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just reuse the make_compatible_name
function here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on this @smaheshwar-pltr
A few nit comments left on this PR but not blocking. Going to merge this PR as is and we can deal with nit comment as a followup
Closes #1458
Closes #175